# load utility packages for later
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline Chapter 3 - Clustering
1 Introduction
Clustering is an Unsupervised Learning technique. The aim is to group our data points into classes/clusters based on their similarity/distance. We want to maximize the similarity of data grouped together and the separation/distance from those data points that are not similar.
These groups that we assign to the data points are called clusters, and are based on measuring how alike data points are in our data space. How we measure this alikeness is down to the method of clustering used, most of which use the distance between data points in some manner.
Clustering allows us to find structures and similarities in data by using simple rules to group data together. Doing so can tell us about hidden properties in the data and groups that may not intuitively be obvious.
Whereas in supervised learning, we used attributes of data to predict classes (based on trained classification model), in clustering we will use attributes of data to create classes. Our algorithms may not be able to produce names for our new classes, but their existence can lead to a greater understanding of the data itself.
1.1 Learning Outcomes
- Understand how distances can be measured.
- Be comfortable applying K-Means clustering.
- Understand how clustering performance can be measured, intrinsically and extrinsically.
- Be comfortable applying Density based Clustering.
- Be comfortable applying Hierarchical Clustering.
- Be able to create dendrograms
2 Measuring Distance
Measuring the distance between two data points was briefly discussed in Case Study A from Introduction to Machine Learning, used in the K-Nearest Neighbour Algorithm.
Quantifying the distance between data points is fundamental to clustering techniques, therefore, we will introduce how to do this more formally in this chapter.
In the “real world” we have a clear understanding of distance, it comes intuitively. However, how we measure distance physically is only one of several methods that exist with data.
2.1 Metrics
We are going to define a metric as a function that measures the distance between two points, whilst having several mathematical properties.
The properties we are going to discuss are necessary so that the ways we quantify the differences between points are valid.
Without the properties we could come up with any mathematical rule and count it as a distance function without it having any real meaning.
We are going to denote the distance function as \(\mathbf{d}\) and the two points in question are \(a\) and \(b\).
Therefore the distance between \(a\) and \(b\) is:
\[\mathbf{d}(a,b)\]
The properties our distance function have are:
2.1.1 Symmetry
The measurements of the distance must be symmetric, so which point is in each position does not change the distance. The distance from \(a\) to \(b\) must be the same as the distance from \(b\) to \(a\).
\[\mathbf{d}(a,b) = \mathbf{d}(b,a)\]
2.1.2 Non-Negativity
Intuitively, we cannot have negative distance measurements, so our distance will always be greater than zero.
\[\mathbf{d}(a,b) \geq 0\]
2.1.3 Equivalence
If two points are at the same points in space then they are equivalent. This means if we have duplicate data the distance between the points will be zero, and any time we have a distance of zero, we have equivalent data.
\[\mathbf{d}(a,b) = 0 \implies a = b\]
2.1.4 Triangular inequality
This final property states that the direct distance between two points is shorter than or equal to any distance via an intermediary point.
\[\mathbf{d}(a,b) \leq \mathbf{d}(a,c) + \mathbf{d}(c,b)\]
2.2 Distance functions
Now that we know what makes a valid distance function we are going to look at some commonly used measures.
2.2.1 Euclidean Distance
The first is the euclidean distance, which is the distance in a straight line measurement in “euclidean space”.
We can calculate the euclidean distance by finding the square root of the sum of squared differences in each of the \(n\) dimensions/coordinates of the data.
The euclidean distance is often referred to as the \(L_2\) distance or “norm”, we will see why later.
\[L_2 = \mathbf{d}(\mathbf{a},\mathbf{b}) = \sqrt{(a_1 - b_1)^2 + (a_2 - b_2)^2 + ... + (a_n - b_n)^2}\]
\[L_2 = \mathbf{d}(\mathbf{a},\mathbf{b}) = \sqrt{\sum_{i=1}^{n}(a_i - b_i)^2} \]
2.2.2 Manhattan Distance
This second method for measuring distance takes it’s name from the street grid layout of Manhattan in New York, as well as many other cities. The intuition for this method is that distance travelled cannot always be described by the shortest path in space, but rather has to follow grid lines.
The Manhattan distance is also called the Taxicab metric or snake distance and we will use the notation \(L_1\).
There can often be more than one path between two points which have the same Manhattan Distance.
The distance is calculated as the sum of lengths between two points in each dimension. This uses the absolute value of the lengths, rather than the squared sum, it is given by:
\[ L_1 = \mathbf{d}(\mathbf{a},\mathbf{b}) = |a_1 - b_1| + |a_2 - b_2| + ... + |a_n - b_n| \]
\[ L_1 = \mathbf{d}(\mathbf{a},\mathbf{b}) = \sum_{i=1}^{n} |a_i - b_i| \]
2.2.3 Generalise
We can generalise the \(L_1\) and \(L_2\) norms into one equation used for measuring distances. This is known as the Minkowski distance, where we have the parameter \(p\), which specifies the type of distance used.
For the Manhattan distance $ p = 1$.
For the Euclidean distance $ p = 2$.
\[ \mathbf{d}(\mathbf{a},\mathbf{b}) = (\sum_{i = 1}^{n} |a_i - b_i|^p)^{\frac{1}{p}}\]
We will be looking at these two methods for measuring distance throughout the course as they can be implemented in different clustering methods.
2.3 Exercise 1
Given the coordinates of the points <strong>A</strong> and <strong>B</strong> below, change the coordinates of <strong>B</strong> such that the euclidean distance is less than the manhattan distance, but the difference is less than <strong>0.1</strong>. In what types of distances are the difference small?
from sklearn.metrics.pairwise import manhattan_distances, euclidean_distances
#set the coordinate for A
A = np.array([[2, 2]])
#you can change the values for B to answer the question
B = np.array([[2, 7]])
#join arrays for easier plotting
euclidean_array = np.concatenate([A,B], axis=0)
#find a corner for manhattan visualisation
corner = np.array([[A[0][0] + (B - A)[0][0], A[0][1]]])
manhattan_array = np.concatenate([A, corner, B], axis=0)
#supporting plot to help deduce positioning
plt.xlim(0, 10), plt.ylim(0, 10)
plt.text(A[0][0]-0.5, A[0][1], s="A")
plt.text(B[0][0]-0.5, B[0][1], s="B")
plt.plot(euclidean_array[:,0], euclidean_array[:,1], "bo-",label="euclidean")
plt.plot(manhattan_array[:,0], manhattan_array[:,1], "go-", label="manhattan")
plt.legend()
#calculate distances
euclidean, manhattan = euclidean_distances(A, B), manhattan_distances(A, B)
print("Euclidean distance between A, B:", euclidean[0][0].round(4))
print("Manhattan distance between A, B:", manhattan[0][0].round(4))
print("Difference in distance metrics:", (manhattan - euclidean)[0][0].round(4))Euclidean distance between A, B: 5.0
Manhattan distance between A, B: 5.0
Difference in distance metrics: 0.0
The euclidean distance is by definition always the shortest distances between two points. When using another distance metric we will end up having a difference with the euclidean, whether or not that difference is useful to us is our job to think about.
The distance between the two types is small when data points are along the same axis (in 2D). From this we can say the more dimensions for which the values are equal, the more similar our distance metrics will be.
The difference between the manhattan and euclidean distances becomes larger with more differences in dimensions. This is because euclidean distance isn’t as affected by the number of dimensions as much as manhattan distance is.
For this reason the manhattan distance can be more useful to us when working with high dimensionality data, with many columns.
from sklearn.metrics.pairwise import manhattan_distances, euclidean_distances
# set the coordinate for A
A = np.array([[2, 2]])
# you can change the values for B to answer the question
B = np.array([[7, 7]])
# the below code will plot and calculate values to help
# answer the question. You do not need to understand it in detail
# it is suggested to not change anything below
# join arrays for easier plotting
euclidean_array = np.concatenate([A,B], axis=0)
# find a corner for manhattan visualisation
corner = np.array([[A[0][0] + (B - A)[0][0], A[0][1]]])
manhattan_array = np.concatenate([A, corner, B], axis=0)
# supporting plot to help deduce positioning
plt.xlim(0, 10), plt.ylim(0, 10)
plt.text(A[0][0]-0.5, A[0][1], s="A")
plt.text(B[0][0]-0.5, B[0][1], s="B")
plt.plot(euclidean_array[:,0], euclidean_array[:,1], "bo-",label="euclidean")
plt.plot(manhattan_array[:,0], manhattan_array[:,1], "go-", label="manhattan")
plt.legend()
# calculate distances
euclidean, manhattan = euclidean_distances(A, B), manhattan_distances(A, B)
print("Euclidean distance between A, B:", euclidean[0][0].round(4))
print("Manhattan distance between A, B:", manhattan[0][0].round(4))
print("Difference in distance metrics:", (manhattan - euclidean)[0][0].round(4))Euclidean distance between A, B: 7.0711
Manhattan distance between A, B: 10.0
Difference in distance metrics: 2.9289
3 KMeans Clustering
The first example of clustering algorithm we will explore is the KMeans method. The K refers to how many clusters we want the algorithm to find for us.
We specify how many clusters we want to find, the algorithm will take the data and work out which points of the data belong to which cluster.
At the end of our algorithm we will have each of our data points having membership of a class. Our algorithm will not tell us what the different classes mean, but we can interpret the meaning using domain knowledge.
This method uses the mean position of each cluster to assign membership of the cluster, we will call this mean position a centroid.
3.1 Algorithm
Input * \(k\) - number of clusters desired * \(X\) - training data of \(n\) rows and \(p\) columns
Step 1 - Initialise Centroids
Select \(k\) initial samples of the data as the starting centroids. This can be done either randomly or using a more complex algorithm approach.
Step 2 - Assign To Nearest Centroid
Calculate the euclidean (in this case) distance between all points and each centroid. Assign each point to it’s nearest centroid.
Step 3 - Calculate New Centroids
Now that we have each data point as a member of a class, we can calculate the new centroid of that class as the mean position of all data points with that class.
Step 4 - Repeat Steps 2 & 3
Now that we have the new centroid values we can reassign each data point to an class based on it’s distance to each newly calculated centroid. Once the assignment takes place new centroids are calculated and so on.
This process continues for some specified number of iterations or until some convergence condition is met.
The output of this algorithm (which data points belong to which final class) will be dependent on the initial selection (seed) of centroids chosen. For this reason the algorithm should be run multiple times and the output checked to avoid poorly generalised models.
How many steps this algorithm will take to run is dependent on several factors. These are:
- \(n\) - The number of data points
- \(K\) - The number of clusters
- \(d\) - How many dimensions (axis) each data point has
- \(i\) - The number of iterations until convergence (how many repeats, determined by how well clustered the data is)
How long it will take for our algorithm to run is linear with \(n\), \(K\), \(d\) and \(i\). Therefore as all these increase so will our algorithm’s runtime.
3.2 Using K-Means
As with most machine learning algorithms we do not need to implement this method ourselves, sklearn provides a class with many parameters we can use.
We are going to look at another toy data set; clusters.csv, to explore the algorithm.
Please ensure you load the packages below
from sklearn.preprocessing import StandardScaler
generic_data = np.loadtxt("../../data/clusters.csv", delimiter=",")
# We have two data columns and a true class column
generic_data.shape(300, 3)
We are going to separate the data and the true label values for now. Also we can standard scale the data to make for easier processing.
# Get the first two columns of the data and scale it
X = StandardScaler().fit_transform(generic_data[:,:-1])
labels_true = generic_data[:,-1]Let’s have a look at the data distribution over the two features.
# Create a density plot for both dimensions
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(8,4), sharey=True)
# Plot distribution of each feature
axes[0].hist(x=X[:,0], bins=20, density=True, label="$x_1$", color="rebeccapurple")
axes[1].hist(x=X[:,1], bins=20, density=True, label="$x_2$", color="dodgerblue")
axes[0].set_ylabel("Density")
axes[0].set_xlabel("$x_1$")
axes[1].set_xlabel("$x_2$")
fig.suptitle("Distribution of each dimension")
fig.legend();In each dimension there are two clear groupings of the data. One grouping in each dimension has a higher density than the other. Plotting the data this way hides some of the structure of the data, so as a result we can plot the dimensions against each other to understand the relationship between the two.
# Plot the data in 2 dimensions
plt.figure(figsize=(5,5))
plt.scatter(x=X[:,0], y=X[:,1], c="navy")
plt.xlabel("$x_1$")
plt.ylabel("$x_2$");Looking at the data it appears that there are 3 groupings of data, so we will select \(k = 3\).
from sklearn.cluster import KMeans
k = 3
# Fit clusterer using basic parameters
km_clusterer = KMeans(n_clusters=k,
init="random", # method for selecting initial cluster points
max_iter=300, # iterations before stopping
random_state=123).fit(X)
labels_found_3 = km_clusterer.labels_C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
# We find as many labels as we give rows of X
labels_found_3.shape(300,)
# Our labels are the group found for each data point
labels_found_3array([2, 2, 1, 0, 1, 1, 1, 2, 1, 0, 2, 2, 2, 1, 0, 2, 0, 0, 0, 0, 1, 1,
0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 2, 2, 2, 1, 2, 2, 2, 0, 2, 0, 0, 1,
1, 2, 1, 1, 0, 1, 2, 1, 2, 1, 1, 2, 0, 2, 0, 2, 0, 2, 0, 1, 1, 1,
1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 2, 1, 1, 0, 1, 1, 2, 1, 1, 1,
0, 0, 1, 1, 2, 0, 0, 1, 1, 0, 1, 2, 1, 0, 2, 0, 1, 2, 2, 1, 2, 2,
0, 1, 2, 2, 0, 0, 0, 1, 1, 2, 1, 0, 2, 0, 1, 2, 2, 2, 2, 1, 0, 1,
0, 0, 0, 0, 1, 2, 1, 1, 1, 1, 2, 0, 0, 0, 0, 2, 2, 0, 0, 1, 0, 0,
2, 2, 2, 2, 0, 1, 1, 1, 1, 2, 1, 1, 2, 1, 0, 0, 1, 2, 1, 0, 2, 2,
2, 0, 0, 2, 2, 2, 0, 0, 1, 2, 2, 2, 0, 2, 1, 2, 2, 2, 2, 0, 0, 2,
2, 0, 0, 2, 1, 1, 0, 2, 2, 0, 0, 0, 2, 1, 0, 2, 0, 0, 1, 2, 1, 1,
2, 0, 2, 0, 0, 1, 0, 1, 1, 2, 1, 2, 0, 2, 0, 0, 2, 0, 2, 2, 0, 1,
0, 1, 0, 1, 0, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 2, 1, 1, 0, 1, 0, 2,
2, 1, 2, 1, 1, 1, 0, 1, 0, 2, 1, 0, 2, 2, 1, 2, 0, 2, 0, 2, 2, 0,
2, 0, 1, 0, 1, 2, 1, 0, 1, 2, 2, 2, 1, 1])
Plotting the clusters that have been found with only one iteration:
fig, ax = plt.subplots(figsize=(5,5))
# Give each label value a different colour to show which value
# was found in clustering.
label_colour_pairs = [(0,"green"), (1,"blue"), (2,"orange")]
for label_value, label_colour in label_colour_pairs:
# Plot all of the data points with a specific label value
X_clust = X[labels_found_3 == label_value]
# Colour based on label value
ax.scatter(x=X_clust[:,0], y=X_clust[:,1],
c=label_colour, label=str(label_value))
ax.set_title("Found Cluster Labels $K=3$")
ax.set_xlabel("$x_1$")
ax.set_ylabel("$x_2$")
ax.legend();The actual classes are given below. It’s important to note that the clustering did not predict which point is in which target class, but rather grouped them into their own classes.
Each label for a cluster doesn’t have an inherent meaning. The below graph shows the target’s true classes. Note that there is some difference between the found clusters in the data, and the true label values.
In clustering you do not have a true value / target as the purpose is to understand the natural grouping of data based on distances between points. We are using the true target values below for learning purposes.
fig, ax = plt.subplots(figsize=(5,5))
# Note - the model and true values order of classes may differ
# True class: 0 != cluster: 0.
# Each time we run the clustering we may get different orders
label_colour_pairs = [(0,'royalblue'), (1,'goldenrod'), (2,'seagreen')]
for label_value, label_colour in label_colour_pairs:
# Plot each data grouped by label
X_clust = X[labels_true == label_value]
ax.scatter(x=X_clust[:,0], y=X_clust[:,1],
c=label_colour, label=str(label_value))
ax.set_title("True Class Values")
ax.set_xlabel("$x_1$")
ax.set_ylabel("$x_2$")
ax.legend();Note that the orderings (0, 1, 2) will differ between found clusters and true classes.
Compare the found clusters and true class values, where is the clustering getting the groupings wrong?
3.3 Exercise 2
Using the data imported below, visualise X, with a scatter plot.
Without looking at the true label values perform K-means clustering on the data with \(K=2\).
Visualise the results of the clustering labels assigned as shown in previous examples. Change the value of \(K\) to a more appropriate value and compare the results.
# Import and separate data
unknown_clusters_data = np.loadtxt("../../data/exercise_one_clusters.csv", delimiter=",")
X, _true_labels= unknown_clusters_data[:,0:-1], unknown_clusters_data[:,-1]
X.shape
plt.scatter(x=X[:,0], y=X[:,1]);
#Perform the clustering, produce group labels
#Create and fit the clustering
kmeans_2_clusters = KMeans(n_clusters=2).fit(X)
#Access the found labels
found_clusters = kmeans_2_clusters.labels_
#Visualise your clustering results
plt.title("K=2 Found Labels")
plt.scatter(x=X[:,0], y=X[:,1], c=found_clusters);
#Find a better value of K
#Repeating the code above and changing the value of n_clusters, visualising the results.
K = 5
kmeans_2_clusters = KMeans(n_clusters=K).fit(X)
found_clusters = kmeans_2_clusters.labels_
plt.scatter(x=X[:,0], y=X[:,1], c=found_clusters);C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
Without using a way to evaluate the performance of a clustering, we can only really visualise and compare groupings using our intuition. Using just X and KMeans clustering there appears to be five different clusters in the data set.
4 Measuring Performance
As with supervised learning it is important to have a way to properly quantify how well a model / algorithm has performed. This is so we can compare different models.
There are broadly two different ways to measure the performance of a clustering algorithm.
The first is a set of methods that compare the output cluster memberships with a value of true class memberships. You can think of this in a similar way to how we compare predictions and true values in supervised learning, except we cannot use the exact true value directly, but whether the clusters separate similarly to the true classes. These methods are called extrinsic, or external.
The second is to look at properties of the clusters themselves, evaluating the similarity of data placed within the same cluster. These methods are called intrinsic or internal.
The extrinsic methods are only possible where ground truth data exists. For this reason we cannot always use these types of methods, but can always use intrinsic methods.
It is important to note that if we have the true class values of a data set and wanted to predict new instances, we would typically use supervised learning rather than clustering. However, in some cases we may have a subset of true values and are trying to compare models on a baseline.
Labelling data is expensive, if we always had labelled data we wouldn’t need to predict anything! So often we will use some labelled data to help evaluate our clustering performance.
4.1 Known Class Membership - Extrinsic
It is important to note that if we have the true class values of a data set and wanted to predict new instances, we would typically use supervised learning rather than clustering. However, in some cases we may have a subset of true values and are trying to compare models on a baseline.
There are a number of extrinsic evaluation methods, we will be exploring the Adjusted Rand Index.
4.1.1 Adjusted Rand Index Properties
The Adjusted Rand Index (ARI) takes as an input the found cluster membership and true class values.
The score is bounded between [-1, 1]. A good ARI, indicating similarity between clusters and classes, is near 1. A bad ARI, indicating no similarity between clusters and classes is negative.
A random selection of clustering compared to true classes will give an ARI of near 0. This is important as it allows us to compare out clustering to a baseline selection.
We do not need to know anything about the structure of the clusters or classes to perform this measure.
The important information about the ARI is discussed above, for more information on how it is calculated see below.
4.1.2 ARI formula
We first start with the Rand Index \(RI\), then adjust it.
We have \(n\) data samples.
Each sample has an associated cluster, in the set of clusters \(K\), and a true class within the set \(T\).
Consider a set of all pairs of samples, such that all pairs represent all possible combinations of data points. The size of these combinations is given by \(C_{2}^{n}\) (\(n\) combination 2, where order does not matter).
The number of combinations of data points where the pairs are within the same cluster (in \(K\)) and the same true class (in \(T\)) is \(\alpha\).
The number of combinations of data points where the pairs are not in the same cluster (in \(K\)) nor in the same true class (in \(T\)) is \(\beta\).
The \(RI\) is therefore given by:
\[ RI = \frac{\alpha + \beta}{C_{2}^{n}} \]
Which will have be bounded by \([0, 1]\).
This however will give us misleading scores if we have randomly selected cluster membership the \(RI\) will not necessarily be near zero.
If we produce a random set of clustering there will be some values that are correct for \(\alpha\) (same cluster, same true cluster) and for \(\beta\) (wrong cluster, wrong true cluster). This means given random chance we will have \(\alpha + \beta > 0\) then \(RI > 0\),
Therefore we take into account the expected random \(RI\), (\(E[RI]\)), and the maximum possible \(RI\) to normalize our value.
The Expected Random RI represents the average RI given all permutations of clustering labels. This helps give a more interpretable score, with random assignment being near zero.
\[ ARI = \frac{RI - E[RI]}{max(RI) - E[RI]} \]
Which will now be bounded by \([-1, 1]\).
4.2 ARI Implementation
Just like all other things we have discussed in this course there is of course an function provided by sklearn to find the ARI.
from sklearn.metrics import adjusted_rand_score
# Scale our X features for ease of clustering
X = StandardScaler().fit_transform(generic_data[:,0:-1])
# Access the true labels from the raw data
labels_true = generic_data[:,-1]
# Compare extrinsically the true and found groupings
k_3_score = adjusted_rand_score(labels_true, labels_found_3)
print("K=3 Adjusted Rand Index Score: ", round(k_3_score, 4))K=3 Adjusted Rand Index Score: 0.9702
We can see that with a ARI of >0.95 the clustering with \(k=3\) performs extremely well (how convenient!).
This can be compared to a random selection of cluster membership.
from random import choices
# Generate random labels from possible labels
labels_random_list = choices(np.unique(labels_true).astype(int), k=len(labels_true))
# Convert to numpy array for consistency and plotting
labels_random = np.array(labels_random_list)
fig, ax = plt.subplots(figsize=(5,5))
label_colour_pairs = [(0,'green'), (1,'blue'), (2,'orange')]
for label_value, label_colour in label_colour_pairs:
X_clust = X[labels_random == label_value]
ax.scatter(x=X_clust[:,0], y=X_clust[:,1],
c=label_colour, label=str(label_value))
ax.set_title("Random Label Values")
ax.set_xlabel("$x_1$")
ax.set_ylabel("$x_2$");random_score = adjusted_rand_score(labels_true, labels_random)
print("Random Label Adjusted Rand Index:", round(random_score, 4))Random Label Adjusted Rand Index: -0.0029
This result shows that our ARI will be near \(0\) when a random sample of cluster membership is chosen.
We can see below what the difference would be with a different value of \(K\).
This produces a result better than random, but worse than \(K=3\).
# Perform another clustering on the data with a different K
k = 2
# Selecting basic parameters "random" and max_iter=1 to demonstrate simplified algorithm
km_clusterer = KMeans(n_clusters=k,
init="random",
max_iter=1,
random_state=123).fit(X)
# Generate labels from clustering
labels_found_2 = km_clusterer.labels_
label_colour_pairs = [(0,'lightcoral'), (1,'darkslateblue')]
fig, ax = plt.subplots(figsize=(5,5))
# Iterate over the different label values and corresponding colours
for label_value, label_colour in label_colour_pairs:
X_clust = X[labels_found_2 == label_value]
ax.scatter(x=X_clust[:,0], y=X_clust[:,1],
c=label_colour, label=str(label_value))
ax.set_title("Found Cluster Labels $K=2$")
ax.set_xlabel("$x_1$")
ax.set_ylabel("$x_2$");C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
k_2_score = adjusted_rand_score(labels_true, labels_found_2)
print("K=2 Adjusted Rand Index Score:", round(k_2_score, 4))K=2 Adjusted Rand Index Score: 0.5299
Other extrinsic methods (use true values) for measuring cluster performance include:
- Mutual Information (normalise, adjusted)
- Homogeneity, Completeness, V-measure (their harmonic mean)
- Fowlkes-Mallows Index
These have been excluded for brevity, but are worth exploring in your own time. Each has benefits and drawbacks just like any other evaluation metric discussed. A basic description and implementation can be found in the sklearn documentation.
4.3 Exercise 3
Using the data imported below and the ARI work out what the best value for \(K\) is. Plot the results of the best \(K\).
Using the true labels work out how many actual groups in the data there are, is this the same as what you found? Plot the true labels and compare results.
#Import and separate data
unknown_clusters_data = np.loadtxt("../../data/exercise_two_clusters.csv", delimiter=",")
X, true_labels= unknown_clusters_data[:,0:-1], unknown_clusters_data[:,-1]
#Find the best value for K using ARI
#Create list to store value of K and associate ARI score
ARI_scores = []
#Generate values of K to be looped through
K_values = range(1, 20)
#Loop though values of K, creating new model and evaluating for each
for K in K_values:
KMeans_model = KMeans(n_clusters=K).fit(X)
found_clusters = KMeans_model.labels_
# Calculate evaluation score
ARI_score = adjusted_rand_score(true_labels, found_clusters)
# Store K and ARI
ARI_scores.append((K, ARI_score))
#Find the highest value ARI score and associated K
best_K, best_score = max(ARI_scores, key=lambda x:x[1])
print("Best K value:", best_K)
print("ARI of best K:", round(best_score, 4))
#Plot the found labels for your value of K
best_KMeans_model = KMeans(n_clusters=best_K).fit(X)
best_found_clusters = best_KMeans_model.labels_
plt.scatter(x=X[:,0], y=X[:,1], c=best_found_clusters);
#This clustering looks very reasonable, with each group compact and separate.
#It would also be useful to look at the different ARI scores of each value of $K$.
#Calculate how many true labels there are
unique_true_labels = np.unique(true_labels)
#Given the unique labels, how many labels is the length of the array
print("Number of true classes:", len(unique_true_labels))
fig, axes = plt.subplots(figsize=(12,5), ncols=2, nrows=1, sharey=True)
axes[0].set_title("True Labels")
axes[0].scatter(x=X[:,0], y=X[:,1], c=true_labels, cmap="magma")
axes[1].set_title("Best Found Labels")
axes[1].scatter(x=X[:,0], y=X[:,1], c=best_found_clusters, cmap="viridis");C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
Best K value: 8
ARI of best K: 0.924
Number of true classes: 8
We can see two main things from this comparision.
Firstly, it shows why we have such a high ARI value, our clustering is getting most of the groupings correct. In each cluster, there are some errors or miss-grouped selections around edges, this is natural.
Secondly we can see why we have found 7 clusters instead of 8. In the True Labels figure there are two clusters that are occupying a significant amount of common space. Without knowing the true labels it is very difficult for an algorithm to differentiate between these two clusters. Some methods are better at this than others.
The effect this may have on our analysis is underestimating the number of groups in our data set. More importantly as a result we may lose two unique group in our analysis, and instead have one larger group.
For example if we were clustering people’s exercise habits, we could end up finding the group “people who play racket sports” rather than “people who play tennis” and “people who play table tennis”.
Without the true label values, it would be challenging to work this out. This is where domain knowledge and a thorough understanding of the data come into play.
4.4 Unknown Class Membership - Intrinsic
For intrinsic evaluation we no longer have the true labels. This means we need to generate meaningful quantities from the data and clusters generated. We can do this using the distances between different positions within the data space.
4.4.1 Silhouette Score Properties
The silhouette coefficient is a value that can be calculated for each data point in the set. When the average of all coefficients is found this gives an overall indication of how well clustering has performed.
The produced silhouette score is bound between \([-1, 1]\), with \(+1\) corresponding to the “best” and \(-1\) the “worst”. A value around \(0\) indicates a high amount of crossing overlapping clusters (we may not see this frequently in KMeans).
Values that are produced are therefore high when our clusters are well separated, and low when not.
We can only calculate the silhouette score on data where \[2 \leq k \leq n - 1\] with \(k\) as the number of labels and \(n\) the number of data points.
4.4.2 Silhouette Score Formula
We first look at calculating a single silhouette coefficient for one data point.
\(a\) is the mean intra cluster distance. This is the sum of all the distances between all points within the cluster the data point is in divided by the number of points in that cluster.
\(b\) is the mean nearest cluster distance. This is sum of all the distances between that data point and the points within the nearest cluster that the point is not a member of. This is then divided by the number of data points in that nearest cluster.
The coefficient is therefore given by:
\[ SC = \frac{b - a}{max(a, b)}\]
We then sum up the coefficients of all data points to produce the mean value, the overall silhouette score.
\[SS = \frac{\sum^{n}_{1}\frac{b - a}{max(a, b)}}{n}\]
4.4.3 Silhouette Score Implementation
We will now look at how to use the Silhouette Score using the \(K=3\) model fitted earlier. We can specify which metric we want to use when calculating the distances depending on what we think might be most appropriate.
The metrics we can use are given by scikit-learn pairwise distances.
Below we are going to calculate the baseline silhouette scores. These are created using the true labels, if the true labels give low silhouette scores then we shouldn’t expect the clustered scores to be high. This means that the data may be inherently difficult to cluster.
If our clustering can give a comparable silhouette scores to the true label scores then they are likely good groupings!
from sklearn.metrics import silhouette_score
# Prepare raw data for clustering
X = StandardScaler().fit_transform(generic_data[:,0:-1])
# Produce baseline true value silhouette scores
sil_score_eucl = silhouette_score(X, labels_true, metric="euclidean")
sil_score_manhat = silhouette_score(X, labels_true, metric="manhattan")
print("Euclidean Silhouette Score: ", round(sil_score_eucl, 4))
print("Manhattan Silhouette Score: ", round(sil_score_manhat, 4))Euclidean Silhouette Score: 0.6219
Manhattan Silhouette Score: 0.5956
Note how each different metric gives us a different score.
We can see how the Silhouette Score changes with \(K\) by training and evaluating a number of models with different \(K\).
# We will iterate over different values of K and produce a model for each value of K
# Choosing a euclidean distance metric for evaluation
k_values = list(range(2, 40))
sil_values = []
for K in k_values:
# Perform the clustering for a given value of K
km_clusterer = KMeans(n_clusters=K, init="random",
max_iter=5, random_state=123).fit(X)
labels_pred = km_clusterer.labels_
# Calculate the silhouette score for associated k clusters
eucl_sil_score = silhouette_score(X, labels_pred, metric="euclidean")
# Add the score the list of scores
sil_values.append(eucl_sil_score)C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
fig, ax = plt.subplots(figsize=(6,4))
plt.plot(k_values, sil_values, c="darkgreen")
ax.set_title("Silhouette Score for different K")
ax.set_xlabel("K")
ax.set_ylabel("Euclidean Silhoette Score");This plot shows that there is a clear best value of \(K\) in this data set, shown by the highest Euclidean Silhouette score at \(K = 3\).
Even though our data is quite clearly three clusters, the score in this case is only ~0.65 because each cluster is not largely separated from each other. This shows clusters that are close together will score more poorly that those far apart.
This method of evaluating the performance of a range of \(K\) values is a useful method of finding an optimal \(K\). Using the silhouette score tells which number of clusters maximises the mean distance between cluster points. (This is in effect a manual grid search we can visualise!)
Even though this can be powerful, it is important we look at the resulting clustering and determine for ourselves if a value of \(K\) is good.
Other intrinsic methods for cluster evaluation include:
- Variance Ratio Criterion (Calinski-Harabasz Index)
- Davies-Bouldin Index (in this measure low is good!)
As with all evaluation, you should consider what it is you are trying to evaluate (what is important to your model / research / problem) then select a method that aligns with this and will help you maximise this quantity.
4.5 Exercise 4
Using the data imported below, perform KMeans clustering with \(K=5\), calculate the silhouette score using both manhattan and euclidean distances.
#Load the data
unknown_clusters_data = np.loadtxt("../../data/exercise_three_clusters.csv", delimiter=",")
X, _true_labels = unknown_clusters_data[:,0:-1], unknown_clusters_data[:,-1]
#Perform Kmeans clustering on the data.
KMeans_model_5 = KMeans(n_clusters=5).fit(X)
#Generate labels and silhouette score using both metrics
found_clusters = KMeans_model_5.labels_
#Score using each metric
sil_score_euc = silhouette_score(X=X, labels=found_clusters, metric="euclidean")
sil_score_man = silhouette_score(X=X, labels=found_clusters, metric="manhattan")
print("Euclidean Silhouette Score:", round(sil_score_euc, 4))
print("Manhattan Silhouette Score:", round(sil_score_man, 4))Euclidean Silhouette Score: 0.5776
Manhattan Silhouette Score: 0.5877
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
Whilst we do not select a distance metric to perform the KMeans clustering itself, we can do so when evaluating our clustering intrinsically. In this case, there is not a significant difference between the scores of Manhattan and Euclidean metrics, but as have more complex data this may not always be true. As with all modelling packages it is important to check what the default parameters are before relying on the results!
5 Density Based Clustering
KMeans clustering is merely one of a range of methods that can be used to create groupings. It is a powerful method, but it does have some inherent problems and areas where it does not perform well.
KMeans clustering excels where there is a clear separation between classes when taking into account the distance from the center. You can think of this as clustering “as the crow flies”.
This is not a good property for some data, for example the two dimensional data set below.
# Load dataset
moon_data = np.loadtxt("../../data/halfmoons.csv", delimiter=",")
# Get the first two columns of the data and scale it
X_moon = StandardScaler().fit_transform(moon_data[:,0:2])
moon_true_labels = moon_data[:,-1]
label_colour_pairs = [(0,'seagreen'), (1,'purple')]
fig, ax = plt.subplots(figsize=(5,5))
# Specify a certain colour for each class
for label_value, label_colour in label_colour_pairs:
X_clust = X_moon[moon_true_labels == label_value]
ax.scatter(x=X_clust[:,0], y=X_clust[:,1],
c=label_colour, label=str(label_value))
ax.set_title("True Class Values")
ax.set_xlabel("$x_1$")
ax.set_ylabel("$x_2$")
ax.legend();We can see that this data is not as simply grouped as previous examples, but why? This is because the center of each cluster is actually very near data of a different class. K-means does not perform well even with the correct true \(K\) value, as shown below.
k = 2
# Train naive model
km_clusterer_moon = KMeans(n_clusters=k,
init="random",
max_iter=1,
random_state=123).fit(X_moon)
km_moon_labels = km_clusterer_moon.labels_
label_colour_pairs = [(0,'magenta'), (1,'lightgreen')]
# Plot found clusters
fig, ax = plt.subplots(figsize=(5,5))
for label_value, label_colour in label_colour_pairs:
X_clust = X_moon[km_moon_labels == label_value]
ax.scatter(x=X_clust[:,0], y=X_clust[:,1],
c=label_colour, label=str(label_value))
ax.set_title("K-Means clusters with $K=2$")
ax.set_xlabel("$x_1$")
ax.set_ylabel("$x_2$")
ax.legend();C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
The KMeans method is unable to consider local structure, it assumes that all clusters are convex, and that the centroid calculated is a significant measurement.
Instead of measuring distances and creating centroids, a density based method takes into account how many data points are within the surrounding area of a “core sample”. A core sample is a sample that is within an area of high density.
The algorithm calculates the distance between points in and out of the core sample, this calculation spreads throughout the whole data in a search tree.
The result of this approach is that density based methods can cluster in interesting shapes - so long as there is a dense enough link from points.
The specific density based clustering algorithm we will be looking at is Density-Based Spatial Clustering of Applications with Noise (DBSCAN), a commonly used method.
5.1 DBSCAN Properties
Instead of clusters being convex, DBSCAN assumes that clusters are high density (many points in small area) and are separated by areas where there are fewer data points in a given area.
Significantly, when compared to K-Means, a number of clusters does not need to be specified, DBSCAN will find this for us.
Other than which distance metric to use, there are two characteristic values used by the DBSCAN algorithm.
The first is the “reach distance”, the radius around a point that a density will be calculated from. As with the rest of our methods using distances, we want to be sure we are centering the data.
The second is the number of points within our reach distance we want to count an area as high density (includes the point in question).
These two quantities, the number of points and surrounding area, allow the density around each point to be calculated. If the density is above the required amount then the new point is added to the cluster.
\[density = \frac{count}{space}\]
Higher number of points or lower distance will require the data to be more dense in order to be added to the cluster.
Some points are within the required distance, but not dense enough to be “core samples”, these points are non-core samples and still part of the cluster, but on the fringes.
Data points a distance further than the reach distance from a core sample are considered outliers by DBSCAN and not given a cluster (the label -1 is given to outliers).
DBSCAN is not inherently stochastic, but label membership can depend on the order of core samples chosen.
5.2 Implementing DBSCAN
Within sklearn the following conventions are used:
- The “reach distance” is called eps (epsilon, \(\epsilon\))
- The number of points required is called min_samples
As these two quantities are reciprocal with regards to density, we can look to improve just one, such as eps.
Changing min_samples can help impact the effect of noisy data.
from sklearn.cluster import DBSCAN
# Define a value for eps
reach_dist = 0.2
# Run the clustering algorithm on data
dbscan_clusterer = DBSCAN(eps=reach_dist, metric="euclidean").fit(X_moon)
db_moon_labels = dbscan_clusterer.labels_
label_colour_pairs = [(-1, "black"), (0,'orchid'), (1,'lightseagreen'), (2, "slateblue"), (3, "yellow")]
# Plot the found clusters and unlabelled data
fig, ax = plt.subplots(figsize=(5,5))
for label_value, label_colour in label_colour_pairs:
X_clust = X_moon[db_moon_labels == label_value]
ax.scatter(x=X_clust[:,0], y=X_clust[:,1], c=label_colour, label=str(label_value))
ax.set_title("DBSCAN with eps={}".format(reach_dist))
ax.set_xlabel("$x_1$")
ax.set_ylabel("$x_2$")
ax.legend();A value of -1 corresponds to “No cluster”. This means the point has not been allocated into an existing cluster at the point the clustering algorithm finishes. In this method - not all points are near enough others to be counted. This property can be used to detect outliers in data.
We can see from this that the DBSCAN is not actually performing as well as we would like, it is showing there are three clusters when intuitively we can see two. There are also some outliers (black) which should be grouped in a main cluster.
By iterating over the eps hyperparameter using a scoring function we can see what a better approximate value for eps might be. See the further reading section for more information on this plot.
We are going to use the silhouette score to measure our clusters performance, then compare the best eps with our original clustering.
# Generate a sequence of possible eps values in a range
possible_eps_values = np.linspace(start=0.01, stop=0.5, num=50)
possible_eps_valuesarray([0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 , 0.11,
0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 , 0.21, 0.22,
0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 , 0.31, 0.32, 0.33,
0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 , 0.41, 0.42, 0.43, 0.44,
0.45, 0.46, 0.47, 0.48, 0.49, 0.5 ])
# We will iterate over different values of eps and produce a model for each iteration
eps_values = []
dbscan_sil_values = []
# Generate a range of eps to try
for reach_dist in possible_eps_values:
# Create labels for each reach_dist
dbscan_clusterer = DBSCAN(eps=reach_dist, metric="euclidean").fit(X_moon)
labels_pred = dbscan_clusterer.labels_
# We can only use the silhouette score when there are >1 labels
if len(np.unique(labels_pred)) > 1:
eucl_sil_score = silhouette_score(X_moon, labels_pred, metric="euclidean")
# Add the eps and score used to a list for plotting
eps_values.append(reach_dist)
dbscan_sil_values.append(eucl_sil_score)fig, ax = plt.subplots()
plt.plot(eps_values, dbscan_sil_values, c="blueviolet")
ax.set_title("Silhouette Score for different $eps$")
ax.set_xlabel("$eps$")
ax.set_ylabel("Euclidean Silhoette Score");This plot shows us that the lower values of eps perform poorly with this data, with more reasonable scores greater that 0.15. In addition, the scores plateau around 0.3.
Below we visualise a range of different eps values to show how the clustering changes with this parameter.
reach_distances = [0.18, 0.2, 0.3, 0.4]
label_colour_pairs = [(-1, "black"), (0,'orchid'), (1,'lightseagreen'), (2, "slateblue"), (3, "yellow")]
fig, ax = plt.subplots(figsize=(16,4), nrows=1, ncols=4, sharey=True)
for axes_index, reach_dist in enumerate(reach_distances):
# Fit the clusterer on the data
dbscan_clusterer = DBSCAN(eps=reach_dist, metric="euclidean").fit(X_moon)
# Generate the labels
db_moon_labels = dbscan_clusterer.labels_
# Plot each label for each clusterer
for label_value, label_colour in label_colour_pairs:
X_clust = X_moon[db_moon_labels == label_value]
ax[axes_index].scatter(x=X_clust[:,0], y=X_clust[:,1],
c=label_colour, label=str(label_value))
ax[axes_index].set_title("$eps={}$".format(reach_dist))
ax[axes_index].set_xlabel("$x_1$")
ax[axes_index].set_ylabel("$x_2$")
# Count the number unlabelled points and total classes
number_unlabelled = np.count_nonzero(db_moon_labels == -1)
number_unique_labels = len(np.unique(db_moon_labels[db_moon_labels != -1]))
ax[axes_index].text(s="Classes: {}".format(number_unique_labels),
x=-2., y=-1.9, fontsize=10)
ax[axes_index].text(s="Unlabelled points: {}".format(number_unlabelled),
x=-2., y=-2.15, fontsize=10)If we compare our eps=0.2 and eps=0.3 cluster labels you can see that there are two classes produced by eps=0.3 which more naturally fits the data. In addition, there are no unclassed data points (denoted in black).
By finding a more appropriate eps value we can group the data into more coherent groups.
As the reach distance increases our allocations may be more resistant to noise, but also more likely to label lower density areas, that are potentially a different cluster.
At a certain size of reach distance we end up labelling all our data in the same class, which is not useful.
As we have seen the DBSCAN can be a powerful method for finding irregularly shaped clusters. When using this density based method we are making a different assumption about our data/labels than for KMeans.
5.3 Exercise 5
Using the data loaded below, using eps=0.15 calculate the silhouette score of the DBSCAN clustering using euclidean distance for both DBSCAN and silhouette values. Visualise the found clusters
#Load the data
swirl_data = np.loadtxt("../../data/exercise_four_clusters.csv", delimiter=",")
X, _true_labels= swirl_data[:,0:-1], swirl_data[:,-1]
#Define the value for eps
reach_dist = 0.15
#Create and fit the model
dbscan_clusterer = DBSCAN(eps=reach_dist, metric="euclidean").fit(X)
found_clusters = dbscan_clusterer.labels_
#Score the found labels
sil_score_euc = silhouette_score(X=X, labels=found_clusters, metric="euclidean")
print("eps={} silhouette score: {:.4f}".format(reach_dist, sil_score_euc))
#Visualise the found clusters
plt.scatter(x=X[:,0], y=X[:,1], c=found_clusters);eps=0.15 silhouette score: 0.1871
5.3.1 DBSCAN Challenges
When using a density based approach we expect our clusters are expected to have a similar density as each other. This helps us to tune an appropriate eps/min_samples. If this is not true then it can often be a challenge to find appropriate parameter values for the algorithm, and produce poor clustering labels.
In the example below we look at a similar set of data to the one used to explore K-means clustering, with three distinct circular clusters. However, this data is slightly different:
- The data in each cluster is “noisier”, each grouping has a larger standard deviation.
- One of the clusters, given by the green colour, is less dense. It has 70% the number of data points with that label compared to the other labels.
We are first going to look at the clustering produced by K-means then compare that with DBSCAN.
# Load the data
density_data = np.loadtxt("../../data/clusters_density.csv", delimiter=",")
density_data.shape(450, 3)
# Process the data
X_density = StandardScaler().fit_transform(density_data[:,0:-1])
labels_true = density_data[:,-1]fig, ax = plt.subplots(figsize=(10,5), nrows=1, ncols=2, sharey=True)
label_colour_pairs_true = [(0,'royalblue'), (1,'goldenrod'), (2,'seagreen')]
# Plot true values
for label_value, label_colour in label_colour_pairs_true:
X_clust = X_density[labels_true == label_value]
ax[0].scatter(x=X_clust[:,0], y=X_clust[:,1],
c=label_colour, label=str(label_value))
ax[0].set_title("True Class Values")
ax[0].set_xlabel("$x_1$")
ax[0].set_ylabel("$x_2$")
ax[0].legend()
# Generate cluster labels using K-means
k = 3
# Use naive implementation for consistency
km_clusterer = KMeans(n_clusters=k,
init="random",
max_iter=1,
random_state=123).fit(X_density)
labels_pred = km_clusterer.labels_
# Calculate silhouette score of found and true clusters
kmeans_density_score = silhouette_score(X_density, labels_pred)
true_score = silhouette_score(X_density, labels_true)
print("Silhouette Score for K=3:", kmeans_density_score.round(3))
print("Silhouette Score for True labels:", true_score.round(3))
label_colour_pairs_found = [(0,"blue"), (1,"orange"), (2,"green")]
# Plot found clusters
for label_value, label_colour in label_colour_pairs_found:
X_clust = X_density[labels_pred == label_value]
ax[1].scatter(x=X_clust[:,0], y=X_clust[:,1],
c=label_colour, label=str(label_value))
ax[1].set_title("Found Cluster Labels $K=3$")
ax[1].set_xlabel("$x_1$")
ax[1].set_ylabel("$x_2$")
ax[1].legend();Silhouette Score for K=3: 0.502
Silhouette Score for True labels: 0.492
C:\Users\tbalb\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1446: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
The K-means clustering looks like it is performing well in this case. Our found clusters are actually more compact than the true values!
If we perform the same process for looking at the effect of eps on our clustering metric we can get silhouette scores for DBSCAN. This will help us pick a good value for eps.
If we iterate over a range of eps, not all values of eps will give us a clustering that has more than one label. Some cluster distances will clump all data points in the same group. If that is so we cannot produce a silhouette score.
# Lists to store plotting values
eps_values = []
dbscan_sil_values = []
# Generate a range of eps to try
for reach_dist in np.linspace(start=0.1, stop=1, num=50):
# Create labels for each reach_dist
dbscan_clusterer = DBSCAN(eps=reach_dist, metric="euclidean").fit(X_density)
labels_pred = dbscan_clusterer.labels_
# We can only use the silhouette score when there are >1 labels
if len(np.unique(labels_pred)) > 1:
eucl_sil_score = silhouette_score(X_density, labels_pred, metric="euclidean")
# Add the eps and score used to a list for plotting
eps_values.append(reach_dist)
dbscan_sil_values.append(eucl_sil_score)
fig, ax = plt.subplots(figsize=(6,4))
plt.plot(eps_values, dbscan_sil_values, c="blueviolet")
ax.set_title("DBSCAN Silhouette Scores for different $eps$")
ax.set_xlabel("$eps$")
ax.set_ylabel("Euclidean Silhoette Score");From our plot above we can see that we get better values as we increase eps up to some point, with the best values approximately \(eps > 0.37\).
But what we have done here is find a parameter based on some metric, we haven’t actually checked what our clusters look like, or if they are grouping the data in an intuitive way.
We could see that there are three main clusters in the data originally, does DBSCAN reproduce this?
reach_distances = [0.24, 0.26, 0.3, 0.6]
fig, ax = plt.subplots(figsize=(16,4), nrows=1, ncols=4, sharey=True)
label_colour_pairs = [(-1, "black"), (0,'orchid'),
(1,'lightseagreen'), (2, "slateblue"),
(3, "yellow"), (4, "cornflowerblue")]
# Produce a plot for each value of eps / reach distance
for axes_index, reach_dist in enumerate(reach_distances):
# Fit the clusterer on the data
dbscan_clusterer = DBSCAN(eps=reach_dist, metric="euclidean").fit(X_density)
# Generate the labels
labels_pred = dbscan_clusterer.labels_
# Plot each label value for each clusterer
for label_value, label_colour in label_colour_pairs:
X_clust = X_density[labels_pred == label_value]
ax[axes_index].scatter(x=X_clust[:,0], y=X_clust[:,1],
c=label_colour, label=str(label_value))
ax[axes_index].set_title("$eps={}$".format(reach_dist))
ax[axes_index].set_xlabel("$x_1$")
ax[axes_index].set_ylabel("$x_2$")
# Count the number unlabelled points and total classes
number_unlabelled = np.count_nonzero(labels_pred == -1)
number_unique_labels = len(np.unique(labels_pred[labels_pred != -1]))
ax[axes_index].text(s="Classes: {}".format(number_unique_labels),
x=-3., y=1.9, fontsize=10)
ax[axes_index].text(s="Unlabelled points: {}".format(number_unlabelled),
x=-3., y=2.25, fontsize=9)Feel free to change the different values of eps shown within reach_distances. What you should be able to see is that our clustering is very sensitive to the value of eps, and none of our options produce the 3 clusters we were expecting.
As a result, we would conclude that with this data set DBSCAN is probably not an appropriate choice in method.
We could also see this by comparing the silhouette scores produced by different methods. For K-means with \(K=3\) we get \(s \approx 0.5\) compared to the maximum from DBSCAN of \(s \approx 0.4\).
As a result, we should always consider multiple clustering methods for our data, as the structure of the data will lend itself to some approaches more than others. By measuring the different methods performances using the same metrics we can then compare methods.
5.4 Exercise 6
Using the data loaded below, visualise the two true classes in the data set, why might these clusters be challenging to model?
Select a distance metric to use, then find an appropriate value of <i>eps</i> to use with DBSCAN using an appropriate evaluation method. Why did you select this distance metric? Visualise the best output found.
HINT: Check how many dimensions are in the data.
#Load the data
further_swirl_data = np.loadtxt("../../data/exercise_five_clusters.csv", delimiter=",")
X, true_labels= further_swirl_data[:,0:-1], further_swirl_data[:,-1]
X.shape
#Visualise the data
fig, axes = plt.subplots(figsize=(12,5), ncols=2)
axes[0].scatter(x=X[:,0], y=X[:,1], c=true_labels, cmap="coolwarm")
axes[1].scatter(x=X[:,0], y=X[:,2], c=true_labels, cmap="coolwarm");
#First explore the sensitivity to eps the data has and visualise the results.
#We can optimise on some evaluation score shortly
#Feel free to change reach_dist to explore impact of value
reach_dist = 0.25
dbscan_clusterer = DBSCAN(eps=reach_dist, metric="euclidean").fit(X)
found_clusters = dbscan_clusterer.labels_
fig, axes = plt.subplots(figsize=(12,5), ncols=2)
axes[0].scatter(x=X[:,0], y=X[:,1], c=found_clusters, cmap="plasma")
axes[1].scatter(x=X[:,0], y=X[:,2], c=found_clusters, cmap="plasma");
#Count how many unlabelled data pounts exist in the data set
#Unlabelled data == -1, we can create a mask then count the True's (1's)
print("Number unlabelled", np.count_nonzero(found_clusters == -1))
#Define the range of eps values
eps_values = np.linspace(start=0.01, stop=1, num=200)
#A list to store eps values and associate scores
sil_scores = []
#Specify the metric
distance_metric = "manhattan"
#Loop over all eps values defined
for reach_dist in eps_values:
DBSCAN_model = DBSCAN(eps=reach_dist, metric=distance_metric).fit(X)
found_clusters = DBSCAN_model.labels_
#silhouette score only defined when more than one cluster found
if len(np.unique(found_clusters)) > 1:
sil_score = silhouette_score(X, found_clusters)
sil_scores.append((reach_dist, sil_score))
#Find eps with highest associated silhouette score
highest_eps, highest_value = max(sil_scores, key=lambda x:x[1])
print("Best value of eps:", round(highest_eps, 4))
print("Best eps Silhouette Score:", round(highest_value, 4))
best_DBSCAN = DBSCAN(eps=highest_eps, metric=distance_metric).fit(X)
best_found_clusters = best_DBSCAN.labels_
fig, axes = plt.subplots(figsize=(12,5), ncols=2)
axes[0].scatter(x=X[:,0], y=X[:,1], c=best_found_clusters, cmap="copper")
axes[1].scatter(x=X[:,0], y=X[:,2], c=best_found_clusters, cmap="copper");
print("Number unlabelled", np.count_nonzero(best_found_clusters == -1))Number unlabelled 33
Best value of eps: 0.5423
Best eps Silhouette Score: 0.1907
Number unlabelled 0
6 Hierarchical Clustering
The third and final class of clustering method we are going to look at is called hierarchical clustering.
It is called hierarchical because the method produces a hierarchy of clustering.
At the bottom, each data point is treated as an individual cluster, at the top there is one single cluster which contains all data points.
Between these two sections at different levels there are a number of clusters containing different subset of the data. We will look at this visually shortly.
There are two types of hierarchical clustering available; agglomerative and divisive.
An agglomerative method starts with all points being a cluster, then combining clusters one by one until we have the right number of clusters.
Divisive methods start with all the data points being in one cluster, the method then breaks up the clusters until the desired number is found.
In this course we will focus on agglomerative methods.
6.1 Algorithm
Below we will go through the steps taken in an agglomerative clustering of data points.
In our example we have six data points in two dimensions. We can see visually that they are roughly two groups of three data points each.
Input * \(X\) - training data of \(n\) rows and \(p\) columns. (\(n\) data points, each with \(p\) dimensions) * \(j\) - number of clusters desired, \(1 \leq j \leq n\)
Step 1
Initialise all data points as their own clusters. There will be \(n\) clusters at this step.
Each data point is within a cluster of one single point.
Step 2
Calculate the distance between all the clusters. Merge the two clusters which have the smallest distance between them (most similar). There will now be one fewer clusters in the data set.
We will look at how we determine the distance between clusters in the next section.
Step 3
Repeat step 2 until the number of clusters is equal to the number of desired clusters, \(j\).
If \(j=1\) the algorithm will continue until all points are within the same cluster.
The end result of our clustering, and how well the method performs will depend on the point in the iteration that we stop combining clusters.
6.1.1 Number of Clusters
If we stop early, with a low value of \(j\) we will have many clusters, each with data points close to one another, or with data points on their own.
If we have a significantly higher value of \(j\) we run the risk of combining clusters that contain very dissimilar points within.
As with KMeans clustering and DBSCAN, the performance of our clustering will be sensitive to the choice of parameter value used within the algorithm.
A significant benefit to this method is that we can explore the model clusters at each step of it’s iteration, allowing us to look at which number of clusters may be appropriate without having to re-run the process.
6.2 Linkage
Up until this point with hierarchical clustering we have been quite vague about how the distance between clusters is measured.
We have discussed how the distance between points can be calculated, using a distance metric (Euclidean, Manhattan and more…), however, not how we measure the distance between groupings of points.
In order to measure the distance between clusters, we need to select specific points related to each cluster, and calculate the distance using a metric between those.
How the distance between clusters are determined is called a “linkage method”. There are several methods, ranging in complexity that allow us to calculate the distance between clusters.
Each linkage method lets us optimise some quality about the resulting clusters produced. This is in a similar way to the fact that different intrinsic evaluations of clustering look for different qualities in the clusters produced.
We will briefly look at a few linkage methods before visualising their outputs and applying them in practice.
6.2.1 Single-Linkage and Complete-Linkage
The first linkage method is an intuitive method. The distance between two clusters is defined as the distance between the two closest points between each cluster.
Single linkage is computed by having the distance (with some metric) between all points in the data set, and combining the clusters which have the smallest distance between two points not in the same cluster.
First we have two clusters, \(A\) and \(B\) where \(a\) and \(b\) are some point in \(A\) and \(B\) respectively
We define the single-linkage function between two clusters as:
\[ D(A,B) = \min_{a\epsilon A, b\epsilon B} d(a, b) \]
\(D(A,B)\) therefore gives us the closest distance between a point in A and a point in B.
This method is quite intuitive, join clusters that are “near” to each other. However, as a result of the linkage there is a tendency for long thin clusters to be produced. This may or may not be suitable for your application.
We will look at this method’s performance in comparison with a select few others shortly.
A similarly structured and commonly used method to single-linkage is complete-linkage.
With complete linkage the distance between clusters is determined using the maximum distance between pairs of points in separate clusters. This helps to make the clusters more compact and avoid long trailing groups. We will look at how to implement this method in the next few sections.
We define the complete-linkage function between two clusters as:
\[ D(A,B) = \max_{a\epsilon A, b\epsilon B} d(a, b) \]
6.3 Exercise
Using the “three_clusters_data” imported below, cluster the data with \(n\_clusters=3\) using both single and complete-linkage with the AgglomerativeClustering model. The code to run the clustering is the same format as previous methods. For both results, calculate the silhouette score. Which is higher? Why do you think this is?
from sklearn.cluster import AgglomerativeClustering
three_clusters_data = np.loadtxt("../../data/clusters.csv", delimiter=",")
X, true_labels = three_clusters_data[:,0:-1], three_clusters_data[:,-1]
#Lets first view our data
plt.scatter(x=X[:,0], y=X[:,1]);
#Create model for each linkage method
single_linkage_model = AgglomerativeClustering(n_clusters=3, linkage="single").fit(X)
complete_linkage_model = AgglomerativeClustering(n_clusters=3, linkage="complete").fit(X)
#Access labels assigned for each method
single_clusters = single_linkage_model.labels_
complete_clusters = complete_linkage_model.labels_
#Evaluate performance of each model and ground truth
single_silhouette = silhouette_score(X, single_clusters)
complete_silhouette = silhouette_score(X, complete_clusters)
true_silhouette = silhouette_score(X, true_labels)
print("Single score:", round(single_silhouette, 4))
print("Complete score:", round(complete_silhouette, 4))
print("'True' score:", round(true_silhouette, 4))
fig, axes = plt.subplots(figsize=(12,5), ncols=2)
axes[0].set_title("Single Linkage")
axes[0].scatter(x=X[:,0], y=X[:,1], c=single_clusters, cmap="viridis")
axes[1].set_title("Complete Linkage")
axes[1].scatter(x=X[:,0], y=X[:,1], c=complete_clusters, cmap="viridis");Single score: 0.2962
Complete score: 0.6208
'True' score: 0.6217
6.3.1 Average-Linkage
With single-linkage we are able to use the distance between two single data points to determine whether to merge clusters. This has drawbacks are our choice in clusters is going to depend heavily on single data points, which may not reflect the collective cluster they represent.
One way of reducing the impact of individual data point variance when merging clusters is to use an average of distances.
Average-linkage is calculated by measuring the average distance between all points in a cluster, with all points in a different cluster.
For example, in clusters \(A\) and \(B\), for every point in \(A\) the distance is measured to every point in \(B\). Using these distances the average distance between the clusters is found by dividing by the number of points in \(A\) times the number of points in \(B\).
The sizes (number of elements within) of clusters \(A\) and \(B\) are given by \(|A|\) and \(|B|\).
The clusters that are merged are then the two with the smallest average distance.
\[D(A,B) = \frac{1}{ |A| \cdot |B|} \sum_{a \epsilon A}\sum_{b \epsilon B}{d(a,b)}\]
Once the value of the average-linkage is found between all clusters in the data set, the two nearest are merged together.
Average-linkage is also called Unweighted Pair Group Method with Arithmetic mean (UWPGMA), and has an alternative method where an weighed average is used instead of unweighted.
6.3.2 Ward-Linkage
A very commonly used method for Agglomerative Clustering is Ward’s method, one of the initial approaches taken in the field.
To determine the Ward-linkage between clusters the variance of data points in a cluster is used.
The variance \(\sigma^2\) within a group of points is the square sum of distances between each point in the group, and the mean point of that group, for \(A\): centroid \(\mu_A\).
\[\sigma^2(A) = \frac{\sum_{a \epsilon A}^{|A|} (a - \mu_A)^2}{|A|} \]
The Ward-linkage selects the clusters to merge which will have the minimum difference in cluster variance between original clusters, and the new merged cluster. The hypothetical cluster of A and B merged is given by \(C = A \cup B\).
The difference in variance to be minimised is generally:
\[\Delta \sigma^2(A,B) = \sigma^2(C) - \sigma^2(A) - \sigma^2(B)\]
Ward’s distance is this difference in variance, defined more formally as:
\[D(A,B) = \frac{\sum_{c \epsilon C}^{|C|} (c - \mu_C)^2}{|C|} - \frac{\sum_{a \epsilon A}^{|A|} (a - \mu_A)^2}{|A|} - \frac{\sum_{b \epsilon B}^{|B|} (b - \mu_B)^2}{|B|}\]
It’s important to note that to calculate the variance we are using the squared sum of distances. This could be the squared sum of euclidean, manhattan or other metrics.
The Ward-linkage is calculated for all clusters, then the two clusters with the smallest ward distance are merged, and the algorithm continues.
6.4 Exercise
Why might agglomerative clustering be slow / resource demanding to implement?
In a naive implementation of the hierarchical clustering described, the time taken to complete the modelling will increase significantly as more data is added.
When the algorithm starts, the distance between every single data point needs to be calculated to every other data point. If we have \(N\) data points that is \(N^2\) distance calculations just for the first step.
For a few hundred/thousand data points each with a couple dimensions this is easy for our computers. However, as we increase the dimensionality of our data (columns), and the number of points (rows), the number of calculations increases rapidly.
Thankfully, most clustering algorithms do not need to calculate every single step and can take some shortcuts to bypass some computing effort. The more shortcuts we make however, the higher the potential for less accurate results.
As a result of the number of calculations, some algorithms can become slow or impossible to run on large data.
For hierarchical clustering we need to store information about the clustering at each step as each merge occurs, this can take up a large amount of computer memory.
6.4.1 Comparing Linkage Methods
Using the code below we are going to produce clustering based on a number of different data sets to see the differences between linkage methods.
A range of data sets are used, as they each have unique properties that can be a challenge to model, however, this is still toy data designed for this purpose.
You will not know what type of linkage is appropriate until you investigate your data! For this visualisation’s purpose we are fixing the number of clusters within each data set, to compare each method equally.
Each type of clustering algorithm on each data set is evaluated, in this case extrinsically using the Adjusted Rand Index. This checks the found group membership against the true labels. A value of 0 represents near random clustering, a value of 1; perfect labelling.
There is quite a bit of code below, it is used to produce the visualisation and does not need to be understood in depth.
# Set up data and parameters to use
# Load each data set
circles = np.loadtxt("../../data/circles_simplified.csv", delimiter=",")
three_clusters_mixed = np.loadtxt("../../data/clusters_density.csv", delimiter=",")
three_clusters = np.loadtxt("../../data/clusters.csv", delimiter=",")
half_moons = np.loadtxt("../../data/halfmoons.csv", delimiter=",")
# Combine the data sets with corresponding names into lists
data_sets = [circles, three_clusters, three_clusters_mixed, half_moons]
data_set_names = ["Concentric Circles", "Equal Density Clusters", "Mixed Density Clusters", "Half-Moons"]
# Define the sklearn appropriate linkage functions to use
linkages = ["single", "complete", "average", "ward"]
# Select the chosen metric to evaluate distances
distance_metric = "euclidean"from sklearn.cluster import AgglomerativeClustering
# Define figure and iteratively create subplots
# Create the required number of subplots
fig, ax = plt.subplots(figsize=(10, 10), ncols=len(data_sets), nrows=len(linkages))
# Space the plots appropriately
plt.subplots_adjust(wspace=0.3, hspace=0.3)
fig.suptitle(t="Variety of linkage methods on diverse data patterns", y=0.94, fontsize=14)
# Iterate over the data sets, each data set is a column of the final plots
for column_index, data in enumerate(data_sets):
# Label the data set columns
ax[0, column_index].set_title(data_set_names[column_index])
# Iterate over the linkage functions, each function is a row of the final plots
for row_index, linkage_function in enumerate(linkages):
# Label the linkage rows
if column_index == 0:
ax[row_index, 0].set_ylabel(linkages[row_index], fontsize=12)
# Split the raw data between data and labels
X, y = data[:,:-1], data[:,-1]
# Get the number of labels in the data set for clustering
n_labels = len(np.unique(y))
# Fit the clusterer on the data using the linkage function
clusterer = AgglomerativeClustering(n_clusters=n_labels,
linkage=linkage_function,
metric=distance_metric).fit(X)
# Collect the labels and score the clusterers performance
labels = clusterer.labels_
score = adjusted_rand_score(y, labels)
# Plot the clusterers found clusters on the data
ax[row_index, column_index].scatter(x=X[:,0], y=X[:,1],
c=labels, cmap="jet", s=12,
label="ARI: {}".format(round(score, 2)))
# Display the ARI on each plot
ax[row_index, column_index].legend(markerscale=0, handlelength=0, fontsize=8)There is a lot going on in this figure, so we are going to break it down by data pattern. A reminder, these are toy data sets and representative of potential situations, investigate your data with a range of methods!
Concentric Circles
As shown in Chapter 4: Dimensionality Reduction, circular data can be a challenge to model. In clustering this is no different, many algorithms struggle to pick up the label pattern.
The exception is the single-linkage method, which labels the different circles as different clusters. It is able to do this as the method favours small distances within clusters, and large gaps between different clusters. The algorithm effectively only looks nearby to merge clusters and does not care what the shape of the result looks like - perfect for this application.
The complete, average and ward linkages perform poorly as demonstrated visually, and by the low ARI scores. Each favours a compact shape to separate groups, which fails when the data is not compact or linearly separable in euclidean space. Consider the average-linkage, the desired average position of both clusters is in the same place right in the center, making it challenging for this method to separate the groups.
Equal Density Clusters
This data pattern is a simple, three cluster data set with normally distributed groups surrounding each centre. This is one of the “easiest” patterns for the majority of clusters to recognise, as each group is compact, and separated.
In an opposite trend to the Concentric Circles, the single-linkage method performs poorly on this data. The ARI of 0.51 is poor compared to the other methods. The clustering appears to be able to group two of the 3 clusters correctly, however, it does not pick up the third cluster as there is one single point (in green) further from a cluster than the nearest in the true third cluster.
The complete, average and ward linkages perform well on this simple grouping. The only difference between the methods is one single extra point being correctly grouped in the average linkage, can you find it?
This data has a relatively small variance around each centre, allowing the ward and average methods to be effective. As the data is separated well, the complete linkage is also appropriate. When we look at “less clean” groupings of data however, we will see that not all these linkages perform well.
Mixed Density Clusters
Compared to the equal density clusters, this data has higher variance around each center, as well as one cluster with a lower density of points. This makes hierarchical clustering, as with other methods, more challenging.
Looking at the results and ARI values, it is clear the methods range from very bad (single-linkage), to okay (complete and average-linkage) up to good (ward-linkage).
- Single-linkage fails to group compactly due to higher distance outliers
- Complete-linkage determines that there are three clusters, in somewhat the right location, however, as it joins clusters with the smallest maximum cluster distance, it fails with closely grouped clusters.
- Average-linkage locates two large clusters, with one dominating the true third labelling.
- Ward-linkage performs well in this situation. This is because the linkage is a variance minimisation method, and the data is normally distributed.
Half-Moons
This data set follows a similar trend to that of the concentric circles. As the true groupings are not compact, complete, average and ward linkages fail to pick up on the two groups.
Single-linkage performs well in this situation, as the clusters are separated adequately, dense and long.
6.4.2 Different Methods
As with all modelling, there are no 100% correct answers when it comes to selecting a technique. However, as you can see above, some methods are more applicable in some situations.
It’s important to note that we have used relatively cleanly separated, small, low-dimensional data, which may not always be the case.
6.5 Dendrograms
A dendrogram is a type of visualisation that helps us to understand how our hierarchical clustering is working. It shows us which clusters are joined in what order, and what the intermediate clusters contain.
On the x-axis are the data points themselves. For large numbers of data points this is sometimes instead a count of points.
On the y-axis is the distance that was measured between clusters before a merge.
The graph itself shows at what distance each cluster was merged. Starting at the bottom all data points separate clusters, until the agglomerative clustering is complete and the final two clusters are combined at the top link.
Each merge of a two clusters is denoted by a horizontal line, a link.
The Data
In the figure on the left we can see our data with the true class labels. What we can see from this data is:
- There are 10 points in two dimensions, labelled 0-9.
- The data points are irregularly spaced, some being closer than others.
- Points 3 and 6 are very close (in euclidean space!), as are 2 and 9.
- Even if we disregard the true label values, we can see there are three natural groupings of the points.
The Graph
The dendrogram shows the result of average linkage hierarchical clustering using a euclidean metric on the data set.
Starting at the bottom is each of the individual data points.
We look at the figure from bottom to top.
The first event that occurs is that points 3 and 6 are merged. This is because they have the smallest average linkage (single point clusters). We can see this as the distance on the y-axis where they are merged is small.
Next, further up the figure, the second nearest points 2 and 9 are merged into one cluster. Again, we can see the relatively small distance between the two (single point) clusters on the y-axis.
We can count how many clusters there are at any point by “drawing” a line across the dendrogram, the number of intersections with the cluster lines is the number of clusters.
At the bottom of the graph there are 10 intersections, as we go up there are fewer and fewer as the clusters merge. At the top we have two, then none as the final link is made, indicating there is one final cluster.
Looking at the dendrogram there is a large distance (y-axis) between the merging of the red and cyan clusters. There is also a significant distance between the green cluster being merged with the other two.
This large distance between merges tells us that the clusters are not likely to be well grouped together.
Along with other methods of evaluation (intrinsic and extrinsic), we can use dendrograms to analyse our clustering algorithm. Although, we should avoid using the graph to determine the number of clusters we want to use in isolation.
6.6 Implementation
This course has broadly taken the approach of using one single library for all machine learning, for simicity sake, and that sklearn contains most algorithms required. However, hierarchical clustering and dendrogram are areas where sklearn is not fully equipped for our purposes.
For this reason we will briefly dip our toes into the scientific computing library scipy. This library is heavily used by sklearn, and is maintained by the same group responsible for numpy. As a result, much of the syntax and data structure will be familiar.
The main difference in clustering between scipy and sklearn is that scipy is not object oriented in design, we will not create a model object, fit it and predict with one class. Instead, we call functions on our data, which produce new numpy data. This gives us more flexibility in the processing we want to complete, but does have a slightly steeper learning curve. For our purposes in this course, this will be kept to a minimum.
As with numpy, I would strongly recommend becoming familiar with scipy as it is frequently used in python statistics and machine learning.
6.6.1 Using scipy for clustering
Below we are going to implement the same clustering as we did in Exercises 1 & 6, this time using scipy in two different ways.
# Load the data
three_clusters_data = np.loadtxt("../../data/clusters.csv", delimiter=",")
X, labels_true = three_clusters_data[:,0:-1], three_clusters_data[:,-1]print("X shape:", X.shape)
print("True label shape:", labels_true.shape)
print("True label values:", np.unique(labels_true))X shape: (300, 2)
True label shape: (300,)
True label values: [0. 1. 2.]
Firstly, we implement the single-linkage method by generating a linkage matrix (storage of clusters, distances and number of points at each iteration of merging).
From that matrix we then determine our actually cluster labels dependent on how we want to approach this. As well as selecting how many clusters we want, we could also specify a threshold distance.
from scipy.cluster.hierarchy import linkage, fcluster
# Returns a linkage matrix from which to derive our results
single_linkage_matrix = linkage(X, 'single')
single_linkage_matrixarray([[2.28000000e+02, 2.49000000e+02, 2.31159979e-03, 2.00000000e+00],
[2.43000000e+02, 3.00000000e+02, 5.92957337e-03, 3.00000000e+00],
[1.79000000e+02, 2.79000000e+02, 1.25142456e-02, 2.00000000e+00],
...,
[5.92000000e+02, 5.94000000e+02, 3.46283784e-01, 2.02000000e+02],
[5.95000000e+02, 5.96000000e+02, 3.59001299e-01, 2.99000000e+02],
[7.40000000e+01, 5.97000000e+02, 3.86136115e-01, 3.00000000e+02]])
single_linkage_matrix.shape(299, 4)
The meaning of the linkage matrix is as follows:
- Each row of the data corresponds to a step in the interation of the clustering
- The first and second columns Z[i,0], Z[i,1] are indexes of clusters
- The third column is the distance between clusters Z[i,0] and Z[i,1].
- The fourth column is the count of data points within Z[i,0] and Z[i,1].
There are \(n-1\) rows in the linkage matrix because there are \(n-1\) steps to combine \(n\) data points.
From this linkage matrix, created using our selected linkage method, we can find our desired clustering.
# Flatten the linkage matrix to find the clusters,
# In this case specify how many clusters `t` we want.
found_single_labels = fcluster(single_linkage_matrix,
criterion="maxclust",
t=3)From our linkage matrix we have all possible clustering using the specified linkage method. We need to specify the criteria and threshold in order to produce one specific clustering. This is done by specifying the threshold value t (it can refer to a count, distance or other quantities) and the criteria to measure by. In the example above “maxclust” is used. This gives us the clustering from the linkage matrix that has t clusters. For different options see the documentation. A common other method is “distance”.
To implement the complete-linkage method we are going to add in an intermediate step, generating a condensed distance matrix before computing the linkage. This is a more memory efficient method of storing the data, which also allows for faster clustering.
Whilst not necessary for our purposes at this time, it’s an interesting look behind the curtain of what is happening in our code.
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import complete
# Calculate condensed distance matrix
condensed_matrix = pdist(X, metric="euclidean")
# Calculate linkage matrix
complete_linkage_matrix = complete(condensed_matrix)
# Flatten linkage matrix using given criteria
found_complete_labels = fcluster(complete_linkage_matrix,
criterion="maxclust",
t=3)We can see the results of the two linkage methods below, the same as shown with sklearn.
fig, ax = plt.subplots(figsize=(8,4), ncols=2, sharey=True)
# Plot single linkage
ax[0].scatter(x=X[:,0], y=X[:,1], c=found_single_labels, cmap="jet")
ax[0].set_title("Single-Linkage")
# Plot complete linkage
ax[1].scatter(x=X[:,0], y=X[:,1], c=found_complete_labels, cmap="jet")
ax[1].set_title("Complete-Linkage");6.7 Exercise
Using the “three_clusters_data” imported below, cluster the data with \(n\_clusters=3\) use single linkage and the scipy package.
Calculate the silhouette score of this clustering.
three_clusters_data = np.loadtxt("../../data/clusters.csv", delimiter=",")
X, labels_true = three_clusters_data[:,0:-1], three_clusters_data[:,-1]
#Calculate the linkage matrix
single_linkage_matrix = linkage(X, 'single')
#Flatten the linakge matrix to the required number of clusters
found_single_labels = fcluster(single_linkage_matrix,
criterion="maxclust", t=3)
single_score = silhouette_score(X, found_single_labels)
print("Single Linkage silhouette score", round(single_score, 4))Single Linkage silhouette score 0.2962
6.7.1 Plotting Dendrograms
Plotting a dendrogram when we have a linkage matrix is simple. However, doing this without one, just with the sklearn model objects is much more challenging.
For this reason I would suggest working in scipy when using hierarchical clustering. Below we are going to take a subset of data and produce a dendrogram using average-linkage.
We will also show how to plot many data points with this graph style.
# Load the data
three_clusters_data = np.loadtxt("../../data/clusters.csv", delimiter=",")
# Set a fixed random seed for reproducibility
np.random.seed(42)
number_of_points = 10
# Get a subset of the original data
three_cluster_subset = three_clusters_data[np.random.choice(three_clusters_data.shape[0],
number_of_points,
replace=False), :]
X, labels_true = three_cluster_subset[:,0:-1], three_cluster_subset[:,-1]from scipy.cluster.hierarchy import dendrogram
# Returns a linkage matrix from which to derive our results
average_linkage_matrix = linkage(X, 'average')
# Create the dendrogram object, this is automatically plotted for us in Jupyter Notebooks
average_dendrogram = dendrogram(average_linkage_matrix)When we have more data however, we can’t plot all of the leaves in our dendrogram, as that figure will have as many lines as we have rows of data.
Instead, we can show the counts of data below a specified merge, the counts are shown in brackets. We call cutting off our figure “truncating”. There are two main options to truncate the graph, by specifying a number of levels (“levels”, or the number of bottom level clusters we want shown (“lastp”).
The threshold for each value is given by p.
# Specify a larger number of data points
number_of_points = 200
# Get a random subset of rows by generating uniform integers
three_cluster_subset = three_clusters_data[np.random.choice(three_clusters_data.shape[0],
number_of_points,
replace=False), :]
X, labels_true = three_cluster_subset[:,0:-1], three_cluster_subset[:,-1]# Returns a linkage matrix from which to derive our results
average_linkage_matrix = linkage(X, 'average')
# Create the dendrogram object, show the counts and specify the method of truncation
average_dendrogram = dendrogram(average_linkage_matrix,
show_leaf_counts=True,
p=10,
truncate_mode="lastp")6.7.2 Exercise
Plot a dendrogram using the data loaded below, previously used in exercise one, and Ward linkage.
Looking at the dendrogram, what appears to be a sensible number of clusters to take?
Calculate the silhouette value of this number of clustering.
#Import and separate data
unknown_clusters_data = np.loadtxt("../../data/exercise_one_clusters.csv", delimiter=",")
X, _true_labels= unknown_clusters_data[:,0:-1], unknown_clusters_data[:,-1]
#Calculate the linkage matrix
ward_linkage_matrix = linkage(X, 'ward')
#Create the dendrogram object, show the counts and specify the method of truncation
average_dendrogram = dendrogram(ward_linkage_matrix,
show_leaf_counts=True,
p=10,
truncate_mode="lastp")
plt.title("Dendrogram of Ward Linkage")
plt.ylabel("Merge Distance");
#Flatten the linkage matrix to the required number of clusters
found_ward_labels = fcluster(ward_linkage_matrix,
criterion="maxclust", t=5)
#Calculate the performance
ward_score = silhouette_score(X, found_ward_labels)
print("Ward silhouette score:", round(ward_score, 4))
#Loop through a range of number of clusters
for number_of_clusters in range(2, 9):
# Flatten the linkage matrix to the specified number of clusters
found_ward_labels = fcluster(ward_linkage_matrix,
criterion="maxclust",
t=number_of_clusters)
ward_score = silhouette_score(X, found_ward_labels)
print("{} clusters, score: {:.4}".format(number_of_clusters, ward_score))Ward silhouette score: 0.5724
2 clusters, score: 0.4633
3 clusters, score: 0.4326
4 clusters, score: 0.5263
5 clusters, score: 0.5724
6 clusters, score: 0.5228
7 clusters, score: 0.5221
8 clusters, score: 0.4787
7 Summary
In this chapter we have looked at how to perform clustering on a data set in order to understand the natural groupings.
In order to first do this we explored how distances can be measured in data, looking at what defines a metric. Following from this the KMeans algorithm was worked through, showing how it groups data, then how to implement this method in sklearn.
As with other areas of Machine Learning, the evaluation of models built is crucial, for this reason we looked at both extrinsic and intrinsic evaluation, for when we do or do not have true labels.
KMeans is not the only clustering algorithm, in addition we looked at density based clustering methods and hierarchical clustering which give us flexibility to model different data distributions. Each method has benefits and good use cases, and also drawbacks. It’s important to understand and get a feel for when each might be appropriate.
Lastly we looked at plotting the hierarchical clustering methods, using these dendrograms to analyse the cluster structures.
Much of the data we have looked at in this session has been synthetic, created to illustrate specific points. It is important to get practice with “real” data, especially in your domain area. Clustering is a useful analysis tool to understand data, but fundamentally the insight gained will always be dependent on the algorithm chosen.
The clustering methods looked at in this chapter have all been deterministic. There are however many commonly used stochastic clustering methods, these were viewed as out of scope for this introductory level course. An outline of them and further resources are outlined in the Further Reading section.
For further practice in this area work through the associated case studies.
8 Further Reading
An alternative method for determining the number of clusters present in a data set is the gap statistic. here is an implementation of this measurement (there is not currently a major package which supports this method). The following link is to the paper which introduced the gap statistic paper.
Agglomerative clustering can be improved significantly in some instances by adding connectivity constraints to the algorithm, imposing local structure and speeding up calculations. An example on image data is given in the sklearn documentation. This paper discusses the impact of connectivity contraints
Gaussian Mixture models are another form of unsupervised learning which can be used for clustering. An example implementation and theory is given on this website and here is the sklean documentation.
For more information and a tutorial on scipy have a look at their reference guide.
8.1 Stochastic Methods
The clustering methods we have looked at so far are deterministic in that given the same starting conditions they will return us the same clusters. This approach is bottom up, given the data and method the clusters will be found without knowledge of overall characteristics of the data.
This approach can yield good results. However, we are not making any assumptions about the structure of our data.
Stochastic methods, primarily Latent Class Analysis (LCA) based models allow us to leverage what we know about the distributions of our data. LCA falls under general Finite Mixure Models, which use probabilistic models. Instead of measuring distances between points (clustering), the FFM models can use the probabilities of a record being a member of a class based on the distributions given.
For more information on LCA please see: